BigQuery APIs

Google Cloud Datalab provides an integrated environment for working with Google BigQuery for both ad-hoc, exploratory work as well as pipeline development. You've already seen the use of %%bq in the Hello BigQuery notebook, and various commands in the BigQuery Commands notebook.

These BigQuery commands are in fact built using the same BigQuery APIs that are available for your own use. This notebook introduces some examples of using those APIs.

Importing the API

The Datalab APIs are provided in the google.datalab Python module, and the BigQuery functionality is contained within the google.datalab.bigquery module.



In [2]:

    
import google.datalab.bigquery as bq

Querying Data

The most important BigQuery-related API is the one that allows you to execute a SQL query. The google.datalab.bigquery.Query class provides that functionality. To run a query using BigQuery Standard SQL, create a new Query object with the desired SQL string, or use an object that has already been defined by the %%bq query --name command. Let's take a look at an example:



In [4]:

    
# Create and run a SQL query
httplogs_query = bq.Query('SELECT * FROM `cloud-datalab-samples.httplogs.logs_20140615` LIMIT 3')

Let's run the query created above with caching turned off, so we're sure to be able to retrieve metadata, such as bytes processed from resulting query job.

For this, we'll need to use a QueryOutput object.



In [5]:

    
output_options = bq.QueryOutput.table(use_cache=False)
result = httplogs_query.execute(output_options=output_options).result()
result









    Out[5]:





    timestamp latency status method endpoint
2014-06-15 07:00:00.536486 48 200 GET Interact3
2014-06-15 07:00:00.003772 122 200 GET Interact3
2014-06-15 07:00:00.428897 144 200 GET Interact3
    
(rows: 3, time: 4.4s,    24MB processed, job: job_AfjJJLqEPxWPmhpTPPuToNO8T8o)

The result object is a QueryResultsTable class, and can be enumerated in the same manner a regular Python list, in addition to retrieving metadata about the result.



In [4]:

    
# Inspecting the result, and the associated job
print(result.sql)
print(str(result.length) + ' rows')
print(str(result.job.bytes_processed) + ' bytes processed')









    



SELECT * FROM `cloud-datalab-samples.httplogs.logs_20140615` LIMIT 3
3 rows
24152138 bytes processed



In [5]:

    
# Inspect the programmatic representation.
# Converting the QueryResultsTable to a vanilla list enables viewing the literal data,
# as well as side-stepping the HTML rendering seen above.
list(result)









    Out[5]:





[{u'endpoint': 'Interact3',
  u'latency': 48,
  u'method': 'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 536486)},
 {u'endpoint': 'Interact3',
  u'latency': 122,
  u'method': 'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 3772)},
 {u'endpoint': 'Interact3',
  u'latency': 144,
  u'method': 'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 428897)}]

The QueryResultsTable has more functionality you can explore, such as converting it to a pandas dataframe, or exporting to a file.



In [7]:

    
result.to_dataframe()









    Out[7]:






  
    
      
      timestamp
      latency
      status
      method
      endpoint
    
  
  
    
      0
      2014-06-15 07:00:00.536486
      48
      200
      GET
      Interact3
    
    
      1
      2014-06-15 07:00:00.003772
      122
      200
      GET
      Interact3
    
    
      2
      2014-06-15 07:00:00.428897
      144
      200
      GET
      Interact3



In [8]:

    
type(result.to_dataframe())









    Out[8]:





pandas.core.frame.DataFrame

Sampling Data

Let's take a look at a sampling query. Consider the following query:



In [9]:

    
UniqueNames2013 = bq.Query(sql='''
  WITH UniqueNames2013 AS
  (SELECT DISTINCT name
    FROM `bigquery-public-data.usa_names.usa_1910_2013`
    WHERE Year = 2013)
  SELECT * FROM UniqueNames2013
''')

To execute the query and view a sample from the result table, use a Sampling object, let's use random sampling for this example:



In [10]:

    
sampling = bq.Sampling.random(percent=2)
job = UniqueNames2013.execute(sampling=sampling)
job.result()









    Out[10]:





    name
Remington
Jimmy
Jovanni
Karis
Shanaya
Gustavo
Bryer
Keenan
Lacy
Derrick
Kenton
Kyleigh
Aaron
Moses
Ariel
Martin
Kristopher
Jack
Krista
Brailyn
Phoebe
Skylar
Rene
Triston
Laniya
    
(rows: 1411, time: 1.7s,    88MB processed, job: job_38BINDzYqU5plG1xp5KK7-tjksk)

Notice every time we run the query above, we get a different set of results, since we chose a random sampling of 2%.

We can also run the query and copy the sampled result into a pandas DataFrame directly. For that, we use a QueryOutput object of type dataframe:



In [11]:

    
output_options = bq.QueryOutput.dataframe(max_rows=10)
job = UniqueNames2013.execute(output_options=output_options)



In [12]:

    
job.result()

Datasets and Tables

In addition to executing queries, BigQuery objects like Datasets, Tables and their Schemas can be accessed programmatically as well.

Listing Resources



In [13]:

    
from google.datalab import Context
datasets = bq.Datasets(Context('cloud-datalab-samples', Context.default().credentials))
for ds in datasets:
  print(ds.name)









    



DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'appenginelogs')
DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'carprices')
DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'httplogs')



In [14]:

    
sample_dataset = list(datasets)[1]
tables = sample_dataset.tables()
for table in tables:
  print('%s (%d rows - %d bytes)' % (table.name.table_id, table.metadata.rows, table.metadata.size))









    



testing (100 rows - 4586 bytes)
training (417 rows - 19086 bytes)



In [15]:

    
table = bq.Table('cloud-datalab-samples.httplogs.logs_20140615')
fields = map(lambda tsf: tsf.name, table.schema)
list(fields)









    Out[15]:





[u'timestamp', u'latency', u'status', u'method', u'endpoint']

Creating Resources



In [16]:

    
# Create a new dataset (this will be deleted later in the notebook)
sample_dataset = bq.Dataset('apisample')
sample_dataset.create(friendly_name = 'Sample Dataset', description = 'Created from Sample Notebook')
sample_dataset.exists()









    Out[16]:





True



In [41]:

    
# To create a table, we also need to create a schema.
# We've seen before how to create a schema from some existing data,
# let's now try creating it from a list of records:
schema = [
  {'name': 'name', 'type': 'STRING'},
  {'name': 'value', 'type': 'INT64'},
  {'name': 'flag', 'type': 'BOOL', 'mode': 'NULLABLE'}
]
sample_schema = bq.Schema.from_data(schema)

sample_table = bq.Table("apisample.sample_table").create(schema = sample_schema, overwrite = True)



In [42]:

    
sample_table.schema









    Out[42]:

You can run the cell, below, to see the contents of the new dataset:



In [ ]:

    
list(sample_dataset.tables())

Deleting Resources



In [44]:

    
# Clear out sample resources
sample_dataset.delete(delete_contents = True)

Looking Ahead

This notebook covered a small subset of the APIs. Subsequent notebooks cover additional capabilities, such as importing and exporting data into and from BigQuery tables.

	name
0	Carmelo
1	Blane
2	Aryan
3	Joeziah
4	Izabell
5	Kevon
6	Tsering
7	Ubaldo
8	Alyanna
9	Zahira

timestamp	latency	status	method	endpoint
2014-06-15 07:00:00.536486	48	200	GET	Interact3
2014-06-15 07:00:00.003772	122	200	GET	Interact3
2014-06-15 07:00:00.428897	144	200	GET	Interact3